Skip to content

feat(query): add field existence, regex, +/- prefix, boost, and ? wildcard query support#39

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/add-lucene-query-support
Draft

feat(query): add field existence, regex, +/- prefix, boost, and ? wildcard query support#39
Copilot wants to merge 3 commits intomainfrom
copilot/add-lucene-query-support

Conversation

Copy link
Contributor

Copilot AI commented Mar 3, 2026

Extends the Lucene-style query parser with the missing syntax features requested in the issue: field:* existence, field:/regex/, required/prohibited prefixes, boost acceptance, and ? single-character wildcards. Also fixes bare quoted phrases ("connection refused") which previously searched with the quotes included.

What

Go query engine (pkg/query/lucene.go)

  • ExistenceFilterfield:* matches entries where the field is present
  • RegexFilterfield:/regex/ applies a compiled regexp.Regexp; extractRegex() handles patterns containing () that the token reader would otherwise cut short
  • + prefix consumed as a no-op (default AND); - prefix wraps the next filter in NotFilter
  • stripBoost() strips trailing ^n from tokens — accepted for syntax compatibility, ignored for filtering
  • ? wildcard in WildcardFilter: detection extended to ContainsAny("*?"), converted to . in the regex
  • Bare quoted phrases strip the quotes before KeywordFilter (previously searched with literal " characters)

UI syntax highlighter (pkg/server/index.html)

  • New --peek-purple CSS var + .hl-regex class for regex literals
  • Word token reader gains inRegex state so ( / ) inside /regex/ don't terminate the token
  • +/- prefixes emitted as hl-op in both field-scoped and bare-keyword positions
  • tokenizeValue treats ? as a wildcard char alongside *

Tests

  • Unit tests for all new filter types and parser behaviors
  • e2e/lucene-query.spec.mjs — 8 Playwright tests: field existence, regex with alternation, FTS keyword, quoted phrase, +/- combined, wildcard, and UI highlighting

Why

The parser silently dropped several Lucene query constructs that users expect to work (field:*, field:/regex/, +term, -term, term^n), and bare quoted phrases matched nothing because the surrounding " were included in the keyword search string.

Related Issue

Closes #9

Testing Done

  • All Go unit tests pass (go test ./... -race -count=1)
  • All 8 new E2E tests pass; existing search.spec.mjs unaffected
  • go vet ./... clean; CodeQL: 0 alerts
Original prompt

This section details on the original issue you should resolve

<issue_title>Add indexed query support for field existence, regex, and full-text search (FTS) while staying local-first</issue_title>
<issue_description>## Summary

Implement Lucene-style query string support for Peek so users can query logs with syntax that is as close to Lucene QueryParser as possible, focusing on:

  • Full-text search (FTS) on message and selected fields (analyzed)
  • Field existence queries (field:*)
  • Regex queries (field:/.../)
  • Wildcards (*, ?), phrases ("..."), required/prohibited (+, -), boosting (^), and boolean logic

Maintain local-first, single-binary distribution and the no-build-step UI model.

Motivation

Peek currently supports a small Lucene-like subset evaluated via scanning (with time-range key seeking). Users want Lucene-like expressiveness, specifically:

  • field:* existence
  • field:/regex/
  • real full-text search behavior (analysis/tokenization), not substring contains

This needs to work both for querying historical logs and for realtime filtering in the UI.

Goals

  1. Accept Lucene-style query string syntax in the UI and API, staying as close to Lucene QueryParser as practical.
  2. Add FTS with an analyzer-driven inverted index (default field behavior like Lucene).
  3. Add field existence query semantics compatible with Lucene (field:*).
  4. Add regex query semantics compatible with Lucene query string (field:/.../).
  5. Keep single binary, local-only, embedded UI in pkg/server/index.html, no new frontend dependencies, immutable VanJS updates.
  6. Add Playwright E2E tests for the new query features.

Non-goals (for this issue)

  • Remote collectors or multi-user deployments
  • Distributed search or external services
  • Full Solr/Elasticsearch feature parity (faceting, aggregations, scoring explanations, etc.)
  • Perfect Lucene scoring parity (ranking differences are acceptable; correctness of filtering is the priority)

Proposed approach (recommended)

Use an embedded Go search index to avoid implementing a full Lucene parser + inverted index from scratch.

Recommendation:

  • Use Bleve's query string query support as the parsing and execution engine for Lucene-like syntax.
  • Keep BadgerDB as the source of truth for stored log entries.

Rationale:

  • Query string syntax supports phrases, field scoping, regex, required/excluded operators, and boosting.
  • Bleve supports query types we need (regexp, wildcard, fuzzy, numeric/date ranges, query string).
  • Keeps local-first and single-binary (just adds a Go dependency and an on-disk index directory).

User-visible query syntax (Lucene-style)

Default field behavior (FTS)

  • Unfielded terms query the default field (configurable), recommended default: message (and optionally a composite field).
    • timeout refused
    • "connection refused"

Field scoping

  • service:api-gateway
  • level:ERROR

Field existence (Lucene semantics)

  • request_id:*
  • user_id:*

Semantics: field is present and has at least one term indexed.

Regex (Lucene query string style)

  • service:/^api-(gateway|edge)$/
  • user_id:/^usr-[0-9]{4}$/

Semantics: regex applies to indexed terms for that field.
Important note:

  • For keyword fields (not analyzed), the term is the full field value, so regex behaves like "regex over the full value".
  • For analyzed fields (like message), regex is term-level, not substring-over-full-text, consistent with Lucene behavior.

Wildcards

  • service:api*
  • request_id:req-?????? (if ? is supported)
  • message:*timeout* (term-level wildcard implications apply)

Boolean and required/prohibited clauses

  • level:ERROR AND service:api
  • +level:ERROR -service:auth

Boosting

  • error^2 timeout

Architecture changes

Storage remains unchanged

  • BadgerDB key format remains: log:{timestamp_nano}:{id}
  • LogEntry JSON stays as-is.

Add embedded index

Introduce an index directory (default under Peek data dir):

  • ~/.peek/index (or ${db_path}/index)

Add configuration:

  • [search] enabled = true|false (default false initially)
  • [search] index_path = "~/.peek/index"
  • [search] default_field = "message"
  • [search] include_in_all = ["message", "raw"] (optional)
  • [search] field_mapping_mode = "dynamic|strict"

CLI flags:

  • --search (enable embedded index)
  • --search-index-path
  • --search-default-field

Index document model

Index one document per log entry with a stable doc id:

  • docID = "{timestamp_nano}:{id}"
  • Badger key can be derived: log:{timestamp_nano}:{id}

Indexed fields (suggested):

  • timestamp (datetime)
  • level (keyword)
  • message (text, analyzed)
  • raw (text or keyword, optional)
  • fields.* (dynamic)
    • strings: keyword by default
    • numbers: numeric
    • booleans: boolean
    • optional: allow marking specific fields as analyzed text via config (eg fields.stacktrace)

Query execution path

When search index is enabled:

  • `/query...

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits March 3, 2026 01:24
…uery support

Co-authored-by: mchurichi <633182+mchurichi@users.noreply.github.com>
…review

Co-authored-by: mchurichi <633182+mchurichi@users.noreply.github.com>
Copilot AI changed the title [WIP] Add indexed query support for field existence and regex feat(query): add field existence, regex, +/- prefix, boost, and ? wildcard query support Mar 3, 2026
@mchurichi
Copy link
Owner

@codex review this

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf09af7297

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +236 to +237
if len(value) >= 2 && value[len(value)-1] == '/' {
return value[1 : len(value)-1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Continue parsing regex after escaped trailing slash

extractRegex currently treats any token ending in / as a complete regex literal, but readToken can stop early at ( or ). For a valid query like path:/foo\/(bar|baz)/, the partial token seen here is "/foo\\/"; this branch returns foo\\ and regexp.Compile fails, so valid regex queries are rejected whenever an escaped slash appears before a parenthesized part.

Useful? React with 👍 / 👎.

Comment on lines +1877 to +1878
if (c === ':' && j + 1 < text.length && text[j + 1] === '/') {
inRegex = true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Limit regex-mode detection to field values that start with '/'

The highlighter switches to regex mode on any :/ sequence, even when it appears later inside a normal value. Queries such as url:https://example.com therefore get split as if https:/ started a regex literal, producing incorrect tokenization/highlighting for common URL filters. This should only trigger for the field/value delimiter followed immediately by /.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add indexed query support for field existence, regex, and full-text search (FTS) while staying local-first

2 participants